This is my final project for my “Introduction to Data Science” class. Through this class we have learned various techniques in order to gather, manipulate, and analyze data. Towards the end of the class we explored machine learning and its application to data. We learned various techniques such as the use of random forests for the forecasting of data.
Now for a bit on Bitcoin. Bitcoin is the first cryptocurrency. Cryptocurrency are a relatively new technology with the potential to revolutionize the world. The origins of Bitcoin are unknown. The bitcoin whitepaper was anonymously authored. Bitcoin’s price level has broken all expectations. This past December it hit an all time high of $19,783.06. Bitcoin has paved the way for various other cryptocurrencies. Daily cryptocurrency transaction volumes are rapidly increasing.
For my final project I will attempt to create a model to predict the future price of Bitcoin. In this project I will attempt time series forecasting. Time series data is used for forecasting something that is changing over time.
Upon doing some research I realized an inherent problem with time series forecasting that I along with the reader should be aware of. When training the random forest you generally train the data by randomly sampling the data and then estimating errors with the rest. When data is coming from a time series you cannot do this as you lose the sequential structure of the data. So in order to build my random forest I must do so in a way that maintains the important sequential structure of the data. More on this here: https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection
Lets start by importing the neccesary libraries for our code.
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(anytime)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyquant)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: PerformanceAnalytics
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.4.2 ✔ purrr 0.2.4
## ✔ tidyr 0.8.0 ✔ dplyr 0.7.4
## ✔ readr 1.1.1 ✔ stringr 1.3.0
## ✔ tibble 1.4.2 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::first() masks xts::first()
## ✖ lubridate::intersect() masks base::intersect()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks xts::last()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ lubridate::union() masks base::union()
library(QuantTools)
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following objects are masked from 'package:xts':
##
## first, last
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
First, we must gather and tidy the data. We obtain the data here: https://www.kaggle.com/vennaa/notebook-forecasting-bitcoins. Although this project takes some inspiration from this Kaggle, we make several different design choices.
Lets first input the file using the built in read.csv function that R provides.
traindata = read.csv("bitcoinset.csv")
Lets list the first few elements:
head(traindata)
## Date Open High Low Close Volume
## 1 Jul 31, 2017 2763.24 2889.62 2720.61 2875.34 860,575,000
## 2 Jul 30, 2017 2724.39 2758.53 2644.85 2757.18 705,943,000
## 3 Jul 29, 2017 2807.02 2808.76 2692.80 2726.45 803,746,000
## 4 Jul 28, 2017 2679.73 2897.45 2679.73 2809.01 1,380,100,000
## 5 Jul 27, 2017 2538.71 2693.32 2529.34 2671.78 789,104,000
## 6 Jul 26, 2017 2577.77 2610.76 2450.80 2529.45 937,404,000
## Market.Cap
## 1 45,535,800,000
## 2 44,890,700,000
## 3 46,246,700,000
## 4 44,144,400,000
## 5 41,816,500,000
## 6 42,455,000,000
Let’s analyze each feature of the data set and analyze its potential relevancy to price.
colnames(traindata)
## [1] "Date" "Open" "High" "Low" "Close"
## [6] "Volume" "Market.Cap"
So we see here that we have columns. I hypothesize that the Open, High, Low, Close, Volume, and Market Cap may all potentially be used to determine the future price. After following the cryptomarkets for a couple of years now, I know that price action generally follows large changes in volume so I am interested to see how this plays a role in changes in price.
We need to tidy up our data so that it is useful for us. We need to convert the Volume and Market Cap columns into numerics. Since our model is a time series we will change the date column to XTS.
traindata$Market.Cap <- as.numeric(traindata$Market.Cap)
traindata$Volume <- as.numeric(traindata$Volume)
traindata$Date <- as.POSIXct(as.Date(anytime(traindata$Date)))
Upon further examination of this data, we see that the volume column is missing some values. The Kaggle that this project took inspiration from chose to omit the volume columns for its forecasting. I would like to use the volumn column as I believe it is relevant to price action. So in order to use this data we must some how fill in the missing columns for volume. I found a similar Kaggle that does the same (https://www.kaggle.com/ara0303/forecasting-of-bitcoin-prices/code). This Kaggle filled in the missing volume column by utilizing the high and low price of the day. Generally, large price swings would indicate high volume. Although I have some qualms with this method, it is the best I could find. Not many of the dates are missing volumn columns so the inaccuracy of these entities should not affect the entire model greatly.
The approach taken in the above Kaggle to fill in missing volume, finds the difference between the high and low of the day and then adds it as a column. It then differentiates between three tiers: difference < 50, 100 > difference > 150, 150 > difference > 350 and fills in the volume based on the average for those differences from the available data. I follow a similar approach.
traindata <- cbind(traindata, traindata$High-traindata$Low)
colnames(traindata)[8] <- "diff"
fifty_avg <- round(mean(traindata$Volume[traindata$a < 50], na.rm = TRUE), digits = 2)
hun_avg <- round(mean(traindata$Volume[traindata$diff > 50 & traindata$diff < 100], na.rm = TRUE), digits = 2)
hf_avg <- round(mean(traindata$Volume[traindata$diff > 100 & traindata$diff < 150], na.rm = TRUE), digits = 2)
th_avg <- round(mean(traindata$Volume[traindata$diff > 150 & traindata$diff < 350], na.rm = TRUE), digits = 2)
for(i in 1:nrow(traindata)){
if(is.na(traindata[i,6])){
if(traindata$diff[i] < 50){
traindata$Volume[i] <- fifty_avg
} else if(traindata$diff[i] < 100){
traindata$Volume[i] <- hun_avg
} else if(traindata$diff[i] < 150){
traindata$Volume[i] <- hf_avg
} else if(traindata$diff[i] < 350){
traindata$Volume[i] <- th_avg
}else
print("Uncaught Title")
}
}
traindata <- traindata[, - 8]
head(traindata)
## Date Open High Low Close Volume Market.Cap
## 1 2017-07-30 20:00:00 2763.24 2889.62 2720.61 2875.34 1245 946
## 2 2017-07-29 20:00:00 2724.39 2758.53 2644.85 2757.18 1129 942
## 3 2017-07-28 20:00:00 2807.02 2808.76 2692.80 2726.45 1205 949
## 4 2017-07-27 20:00:00 2679.73 2897.45 2679.73 2809.01 28 937
## 5 2017-07-26 20:00:00 2538.71 2693.32 2529.34 2671.78 1187 920
## 6 2017-07-25 20:00:00 2577.77 2610.76 2450.80 2529.45 1286 925
Lets reorder the data in ascending order:
traindata <- traindata %>% arrange(as.numeric(traindata$Date))
head(traindata)
## Date Open High Low Close Volume Market.Cap
## 1 2013-04-27 20:00:00 135.30 135.98 132.10 134.21 1 130
## 2 2013-04-28 20:00:00 134.44 147.49 134.00 144.54 1 125
## 3 2013-04-29 20:00:00 144.00 146.93 134.05 139.00 1 158
## 4 2013-04-30 20:00:00 139.00 139.89 107.72 116.99 1 142
## 5 2013-05-01 20:00:00 116.38 125.60 92.28 105.21 1 75
## 6 2013-05-02 20:00:00 106.25 108.13 79.10 97.75 1 37
Exploratory Data Analysis
Now that we have cleaned up our data we can move on to analyze the data. Data analysis can be used to get a general idea of how our data is distributed and for spotting obvious trends. For more on this check out: http://www.hcbravo.org/IntroDataSci/bookdown-notes/part-exploratory-data-analysis.html. Utilizing the summary function in R we can learn more about the different features of our data.
summary(traindata)
## Date Open High
## Min. :2013-04-27 20:00:00 Min. : 68.5 Min. : 74.56
## 1st Qu.:2014-05-21 14:00:00 1st Qu.: 254.3 1st Qu.: 260.33
## Median :2015-06-14 08:00:00 Median : 438.6 Median : 447.56
## Mean :2015-06-14 08:00:00 Mean : 582.6 Mean : 597.99
## 3rd Qu.:2016-07-07 02:00:00 3rd Qu.: 662.4 3rd Qu.: 674.52
## Max. :2017-07-30 20:00:00 Max. :2953.2 Max. :2999.91
## Low Close Volume Market.Cap
## Min. : 65.53 Min. : 68.43 Min. : 1.0 Min. : 1.0
## 1st Qu.: 248.84 1st Qu.: 254.32 1st Qu.: 147.8 1st Qu.: 388.8
## Median : 430.57 Median : 438.86 Median : 536.5 Median : 774.5
## Mean : 567.85 Mean : 584.24 Mean : 555.4 Mean : 775.7
## 3rd Qu.: 646.74 3rd Qu.: 663.40 3rd Qu.: 925.2 3rd Qu.:1163.2
## Max. :2840.53 Max. :2958.11 Max. :1314.0 Max. :1552.0
From this we see the volatiliy of Bitcoins price. There is a clear evident up-trend. Lets visualize the Bitcoin’s price using ggplot.
ggplot(traindata, aes(as.Date(Date), Close)) + geom_line() + ylab("Closing Price") + xlab("Date")
Traders often use candlestick charts to when trading. Candlestick charts indicate the high, low, open, and close for the day. More on this here: https://www.investopedia.com/terms/c/candlestick.asp. Lets use a different R library to show this. Candlestick plots are widely used because they display a wide array of information all in one chart.
p <- traindata %>%
plot_ly(x = ~Date, type="candlestick",
open = ~traindata$Open, close = ~traindata$Close,
high = ~traindata$High, low = ~traindata$Low) %>%
layout(title = "Bitcoin Candlestick Chart",
xaxis = list(rangeslider = list(visible = F)))
p
This produces an interactive zoomable chart to better get a sense of the of the data for a particular time period.
Another widely traded on technical indicator used in stocks and crypto trading is known as the MACD. MACD stands for Moving Average Convergence Divergence. The MACD is calculated by subtracted two exponential moving averages. When there is a crossover of the two exponential moving averages this signals a buy or a sell. When the the MACD rises significantly this indicates overbought or oversold territory. Lets see what this looks like on the chart.
ggplot(traindata, aes(as.Date(Date), Close)) + geom_line() + ylab("Closing Price") + xlab("Date") +
geom_ma(ma_fun = SMA, n = 26) +
geom_ma(ma_fun = SMA, n = 12, color = "red")
MACD is so widely traded on it would be interesting to include this as a feature in our dataset so that we can include it in our machine learning algorithms later on. In order to do this I calculate the difference between the 26-day and 12-day exponential moving average and include it as a column. These perameters can be played with later on.
traindata <- cbind(traindata, ema(traindata$Close, 26)-ema(traindata$Close, 12))
colnames(traindata)[8] <- "MACD"
head(traindata)
## Date Open High Low Close Volume Market.Cap MACD
## 1 2013-04-27 20:00:00 135.30 135.98 132.10 134.21 1 130 NA
## 2 2013-04-28 20:00:00 134.44 147.49 134.00 144.54 1 125 NA
## 3 2013-04-29 20:00:00 144.00 146.93 134.05 139.00 1 158 NA
## 4 2013-04-30 20:00:00 139.00 139.89 107.72 116.99 1 142 NA
## 5 2013-05-01 20:00:00 116.38 125.60 92.28 105.21 1 75 NA
## 6 2013-05-02 20:00:00 106.25 108.13 79.10 97.75 1 37 NA
Predictive Modeling
Now on to the fun part. Let’s see what different ways we can try to accomplish the seemingly impossible task of forecasting the price of Bitcoin. Initially, i’ve had a few ideas on how to do this. A simple linear regression model could be used. This is a form of regression analysis that attempts to find a relationship between a set of variables. In this case it would be the seven features of our dataset. A model like this seems rather easy to create so lets check it out.
So lets try to see if there is a model where we can use the MACD attribute we calculated to forecast the closing price of Bitcoin. So Closing Price = B0 + B1 *MACD.
Lets use R’s lm function. The lm function is used to fit linear models.
mod = lm(traindata$Close ~ scale(traindata$MACD, center=TRUE, scale=FALSE), data = traindata)
summary(mod)
##
## Call:
## lm(formula = traindata$Close ~ scale(traindata$MACD, center = TRUE,
## scale = FALSE), data = traindata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1010.8 -261.9 -91.3 139.2 2626.2
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 591.8440 10.8337
## scale(traindata$MACD, center = TRUE, scale = FALSE) -7.4801 0.2591
## t value Pr(>|t|)
## (Intercept) 54.63 <2e-16 ***
## scale(traindata$MACD, center = TRUE, scale = FALSE) -28.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 423.9 on 1529 degrees of freedom
## (25 observations deleted due to missingness)
## Multiple R-squared: 0.3529, Adjusted R-squared: 0.3524
## F-statistic: 833.7 on 1 and 1529 DF, p-value: < 2.2e-16
qplot(traindata$MACD, traindata$Close, data = traindata, main = "Relationship between MACD and Bitcoin Closing Price") +
stat_smooth(method="lm", col="red")
## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).
I centered the data on MACD to account for the large spread in Bitcoin price, in an attempt to create a more accurate model. Visually looking at the scatterplot the fitted line does not seem to follow any pattern. Although we have a small p-value our R2 is a mere .35. This indicates taht a only 35% of the variance observed in price can be explained by the MACD.
Accurately predicting the price of Bitcoin seems to be challenging. Lets see if rather we can do a binary classification and predict the direction of price day to day. In order to do this we must add another feature onto our dataset. To use this I added a boolean feature that is denoted as False if the price went down and True if the price went up. This could potentially be used later on with a trading bot where True signifies a buy signal and False indicated a sell signal.
traindata <- cbind(traindata, (traindata$Close-traindata$Open > 0))
colnames(traindata)[9] <- "diff"
head(traindata)
## Date Open High Low Close Volume Market.Cap MACD
## 1 2013-04-27 20:00:00 135.30 135.98 132.10 134.21 1 130 NA
## 2 2013-04-28 20:00:00 134.44 147.49 134.00 144.54 1 125 NA
## 3 2013-04-29 20:00:00 144.00 146.93 134.05 139.00 1 158 NA
## 4 2013-04-30 20:00:00 139.00 139.89 107.72 116.99 1 142 NA
## 5 2013-05-01 20:00:00 116.38 125.60 92.28 105.21 1 75 NA
## 6 2013-05-02 20:00:00 106.25 108.13 79.10 97.75 1 37 NA
## diff
## 1 FALSE
## 2 TRUE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
Now lets see if can create a Random Forest using the MACD and Volume to predict the direction of price. Although in real life trading algorithms we would probably use change in volume. Random forests are random decision trees that learn by randomly sampling data. It essentially builds multiple decision trees and then merges them to make a more accurate model.
traindata$diff <- as.character(traindata$diff)
traindata$diff <- as.factor(traindata$diff)
output.forest <- randomForest(traindata$diff ~ MACD + Volume,
data = traindata, na.action = na.omit)
# View the forest results.
print(output.forest)
##
## Call:
## randomForest(formula = traindata$diff ~ MACD + Volume, data = traindata, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 49.44%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 272 432 0.6136364
## TRUE 325 502 0.3929867
From the above data we can see that our model is extremely inaccurate. This is probably due to a multitude of factors. Volume is most likely an extremely poor indicator of price. Instead we probably need to look at change in volume from the previous day.
Generally, random forests are used with far more features. It would be interesting to see how that would affect bitcoin prices.
Additional Reading:
^ The above article talks about the efficacy of using bayesian regression for bitcoin price. The study done at MIT, shows that bayesian regression is successful at predicting bitcoin price changes. An idea I had was for creating a random forest where each node used bayesian regression where a multitude of factors were used including sentiment analysis to create a better model. This could be a potential topic for future research.